Note: You can use this file as you ‘working document’ where you can try out various investigation ideas and keep notes about your findings. How you use and structure this file is up to you. It is recommended that you keep notes about what you are investigating and what you find as this will make the process of creating your presentation and report easier. Please note that you do not need to submit this file as part of your group project.
options(repos = c(CRAN = "https://cloud.r-project.org"))
library(tidyverse)
# Add any other libraries here
library(dplyr)
library(tidyr)
Analysis Steps Load the Dataset:
Imported from the CSV file. Summary statistics reviewed.
Data Cleaning: Handled missing values. Removed features with excessive missing data. Standardized data for regression.
What kind of data does it contain?
The dataset includes over 120 variables, mostly numerical, grouped into several categories: 1. Crime-related variables Examples: ViolentCrimesPerPop (main target — violent crimes per 100K population),population These describe the crime levels in each community.
Socioeconomic attributes Examples: medIncome (median income),pctWWage These capture the economic and educational environment.
Demographic attributes Examples: racePctWhite, racepctBlack, racePctAsian These describe the population composition of each community.
Family characteristics Examples: PctIlleg Reflect family structure and child well-being.
Housing characteristics Examples:
Capture living conditions and housing costs.
These describe police department resources and characteristics.
These exist in the original file but are usually removed:
Problem Statement: The goal of this project is to investigate the factors that are most strongly associated with violent crime rates across U.S. communities using the UCI Communities and Crime dataset. This dataset contains a wide range of socioeconomic, demographic, housing, and policing attributes, all of which may contribute differently to community crime levels.
The primary objectives are:
1.To identify patterns and relationships between community characteristics and violent crime.
2.To determine which variables are the strongest predictors of violent crime rates.
3.To develop and evaluate predictive models (e.g., linear regression and random forest) capable of estimating violent crime rates using community-level attributes.
A secondary objective is to reflect on ethical considerations, such as the potential risks of using demographic variables in crime prediction models.
Exploratory Data Analysis (EDA): Visualized relationships between features and crime rates. Identified patterns and key features.
Regression Modeling: Linear Regression: Modeled continuous crime rates based on features. Logistic Regression: Converted the crime rate into binary classification (e.g., high/low) for logistic analysis. Evaluation:
Analyzed model performance metrics (e.g., accuracy, R-squared, confusion matrix).
Classic usage scenarios: In the field of crime analysis and prediction, the classic application scenarios of the Communities and Crime dataset mainly focus on predicting the crime rate of communities through regression models.
Researchers utilize the socio-economic, law enforcement, and demographic data in this dataset to identify the key factors influencing the crime rate and make predictions. This analysis not only helps understand the drivers of the crime rate but also provides data support for policymakers to formulate more effective prevention strategies.
Solve academic problems: The Communities and Crime dataset addresses a fundamental issue in criminology research: how to quantify and predict the crime rate in communities. By integrating multi-dimensional socio-economic and demographic data, this dataset provides a rich resource for the academic community to explore the complex relationships between crime rates and various social factors. This not only drives the research on crime prediction models but also offers scientific basis for decision-making in the fields of social policies and public safety.
Practical application: In practical applications, the Communities and Crime dataset is widely used in urban planning and public safety management. For instance, local governments and law enforcement agencies can utilize the analysis results of this dataset to optimize resource allocation and enhance community security. Additionally, non-profit organizations and community groups can also leverage these data to design targeted social intervention programs in order to reduce crime rates and improve the community environment.
| Type | What to Add | Why It Improves Your Project |
|---|---|---|
| 1️⃣ | Correlation heatmap | Visualize variable relationships clearly |
| 2️⃣ | Distribution plots | Show how target and predictors vary |
| 3️⃣ | Variable selection (feature reduction) | Simplify model and improve interpretability |
| 4️⃣ | Cross-validation | Make model evaluation more robust |
| 5️⃣ | Residual analysis | Check model errors for bias |
| 6️⃣ | Map or geographic visualization (optional) | Show crime rates per area (if you find state/county info) |
| 7️⃣ | Model comparison plot | Visually compare models |
| 8️⃣ | Deeper ethics discussion | Shows awareness of real-world implications |
Ethical Considerations
The Communities and Crime dataset contains sensitive attributes such as race, income, and family structure.
These variables can reflect systemic inequalities and should be treated carefully: - Avoid using models like these for individual prediction or policing.
- Be aware of bias amplification: if biased data (e.g., due to unequal policing) is used to train a model, predictions will also be biased.
- Focus on understanding community-level patterns and informing equitable policy decisions rather than punitive actions.Ethical data science means asking not just “Can we predict it?” but also “Should we?”.
library(tidyverse)
options(repos = c(CRAN = "https://cloud.r-project.org"))
# Load column names
names_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names"
name_lines <- readLines(names_url)
name_lines <- name_lines[grepl("^@attribute", name_lines)]
col_names <- gsub("^@attribute\\s+", "", name_lines)
col_names <- sub("\\s+.*$", "", col_names)
col_names <- col_names[col_names != ""]
# Load dataset
data_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data"
crime <- read.csv(data_url, header = FALSE, na.strings = "?", col.names = col_names)
# Clean dataset
crime <- crime %>%
select(-state, -county, -community, -communityname, -fold)
# Replace missing values with column means
library(dplyr)
data_clean <- crime %>%
mutate(across(where(is.numeric),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Print the new shape of the cleaned data
summary(data_clean)
## population householdsize racepctblack racePctWhite
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.01000 1st Qu.:0.3500 1st Qu.:0.0200 1st Qu.:0.6300
## Median :0.02000 Median :0.4400 Median :0.0600 Median :0.8500
## Mean :0.05759 Mean :0.4634 Mean :0.1796 Mean :0.7537
## 3rd Qu.:0.05000 3rd Qu.:0.5400 3rd Qu.:0.2300 3rd Qu.:0.9400
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## racePctAsian racePctHisp agePct12t21 agePct12t29
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0400 1st Qu.:0.010 1st Qu.:0.3400 1st Qu.:0.4100
## Median :0.0700 Median :0.040 Median :0.4000 Median :0.4800
## Mean :0.1537 Mean :0.144 Mean :0.4242 Mean :0.4939
## 3rd Qu.:0.1700 3rd Qu.:0.160 3rd Qu.:0.4700 3rd Qu.:0.5400
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## agePct16t24 agePct65up numbUrban pctUrban
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.2500 1st Qu.:0.3000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.2900 Median :0.4200 Median :0.03000 Median :1.0000
## Mean :0.3363 Mean :0.4232 Mean :0.06407 Mean :0.6963
## 3rd Qu.:0.3600 3rd Qu.:0.5300 3rd Qu.:0.07000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## medIncome pctWWage pctWFarmSelf pctWInvInc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2000 1st Qu.:0.4400 1st Qu.:0.1600 1st Qu.:0.3700
## Median :0.3200 Median :0.5600 Median :0.2300 Median :0.4800
## Mean :0.3611 Mean :0.5582 Mean :0.2916 Mean :0.4957
## 3rd Qu.:0.4900 3rd Qu.:0.6900 3rd Qu.:0.3700 3rd Qu.:0.6200
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## pctWSocSec pctWPubAsst pctWRetire medFamInc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3500 1st Qu.:0.1425 1st Qu.:0.3600 1st Qu.:0.2300
## Median :0.4750 Median :0.2600 Median :0.4700 Median :0.3300
## Mean :0.4711 Mean :0.3178 Mean :0.4792 Mean :0.3757
## 3rd Qu.:0.5800 3rd Qu.:0.4400 3rd Qu.:0.5800 3rd Qu.:0.4800
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## perCapInc whitePerCap blackPerCap indianPerCap
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2200 1st Qu.:0.240 1st Qu.:0.1725 1st Qu.:0.1100
## Median :0.3000 Median :0.320 Median :0.2500 Median :0.1700
## Mean :0.3503 Mean :0.368 Mean :0.2911 Mean :0.2035
## 3rd Qu.:0.4300 3rd Qu.:0.440 3rd Qu.:0.3800 3rd Qu.:0.2500
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## AsianPerCap OtherPerCap HispPerCap NumUnderPov
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1900 1st Qu.:0.1700 1st Qu.:0.2600 1st Qu.:0.01000
## Median :0.2800 Median :0.2500 Median :0.3450 Median :0.02000
## Mean :0.3224 Mean :0.2847 Mean :0.3863 Mean :0.05551
## 3rd Qu.:0.4000 3rd Qu.:0.3600 3rd Qu.:0.4800 3rd Qu.:0.05000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## PctPopUnderPov PctLess9thGrade PctNotHSGrad PctBSorMore
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.110 1st Qu.:0.1600 1st Qu.:0.2300 1st Qu.:0.2100
## Median :0.250 Median :0.2700 Median :0.3600 Median :0.3100
## Mean :0.303 Mean :0.3158 Mean :0.3833 Mean :0.3617
## 3rd Qu.:0.450 3rd Qu.:0.4200 3rd Qu.:0.5100 3rd Qu.:0.4600
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctUnemployed PctEmploy PctEmplManu PctEmplProfServ
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2200 1st Qu.:0.3800 1st Qu.:0.2500 1st Qu.:0.3200
## Median :0.3200 Median :0.5100 Median :0.3700 Median :0.4100
## Mean :0.3635 Mean :0.5011 Mean :0.3964 Mean :0.4406
## 3rd Qu.:0.4800 3rd Qu.:0.6275 3rd Qu.:0.5200 3rd Qu.:0.5300
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctOccupManu PctOccupMgmtProf MalePctDivorce MalePctNevMarr
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2400 1st Qu.:0.3100 1st Qu.:0.3300 1st Qu.:0.3100
## Median :0.3700 Median :0.4000 Median :0.4700 Median :0.4000
## Mean :0.3912 Mean :0.4413 Mean :0.4612 Mean :0.4345
## 3rd Qu.:0.5100 3rd Qu.:0.5400 3rd Qu.:0.5900 3rd Qu.:0.5000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## FemalePctDiv TotalPctDiv PersPerFam PctFam2Par
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3600 1st Qu.:0.3600 1st Qu.:0.4000 1st Qu.:0.4900
## Median :0.5000 Median :0.5000 Median :0.4700 Median :0.6300
## Mean :0.4876 Mean :0.4943 Mean :0.4877 Mean :0.6109
## 3rd Qu.:0.6200 3rd Qu.:0.6300 3rd Qu.:0.5600 3rd Qu.:0.7600
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctKids2Par PctYoungKids2Par PctTeen2Par PctWorkMomYoungKids
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4900 1st Qu.:0.530 1st Qu.:0.4800 1st Qu.:0.3900
## Median :0.6400 Median :0.700 Median :0.6100 Median :0.5100
## Mean :0.6207 Mean :0.664 Mean :0.5829 Mean :0.5014
## 3rd Qu.:0.7800 3rd Qu.:0.840 3rd Qu.:0.7200 3rd Qu.:0.6200
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## PctWorkMom NumIlleg PctIlleg NumImmig
## Min. :0.0000 Min. :0.00000 Min. :0.00 Min. :0.00000
## 1st Qu.:0.4200 1st Qu.:0.00000 1st Qu.:0.09 1st Qu.:0.00000
## Median :0.5400 Median :0.01000 Median :0.17 Median :0.01000
## Mean :0.5267 Mean :0.03629 Mean :0.25 Mean :0.03006
## 3rd Qu.:0.6500 3rd Qu.:0.02000 3rd Qu.:0.32 3rd Qu.:0.02000
## Max. :1.0000 Max. :1.00000 Max. :1.00 Max. :1.00000
## PctImmigRecent PctImmigRec5 PctImmigRec8 PctImmigRec10
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1600 1st Qu.:0.2000 1st Qu.:0.2500 1st Qu.:0.2800
## Median :0.2900 Median :0.3400 Median :0.3900 Median :0.4300
## Mean :0.3202 Mean :0.3606 Mean :0.3991 Mean :0.4279
## 3rd Qu.:0.4300 3rd Qu.:0.4800 3rd Qu.:0.5300 3rd Qu.:0.5600
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctRecentImmig PctRecImmig5 PctRecImmig8 PctRecImmig10
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300 1st Qu.:0.0300
## Median :0.0900 Median :0.0800 Median :0.0900 Median :0.0900
## Mean :0.1814 Mean :0.1821 Mean :0.1848 Mean :0.1829
## 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300 3rd Qu.:0.2300
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctSpeakEnglOnly PctNotSpeakEnglWell PctLargHouseFam PctLargHouseOccup
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.7300 1st Qu.:0.0300 1st Qu.:0.1500 1st Qu.:0.1400
## Median :0.8700 Median :0.0600 Median :0.2000 Median :0.1900
## Mean :0.7859 Mean :0.1506 Mean :0.2676 Mean :0.2519
## 3rd Qu.:0.9400 3rd Qu.:0.1600 3rd Qu.:0.3100 3rd Qu.:0.2900
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PersPerOccupHous PersPerOwnOccHous PersPerRentOccHous PctPersOwnOccup
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3400 1st Qu.:0.3900 1st Qu.:0.2700 1st Qu.:0.4400
## Median :0.4400 Median :0.4800 Median :0.3600 Median :0.5600
## Mean :0.4621 Mean :0.4944 Mean :0.4041 Mean :0.5626
## 3rd Qu.:0.5500 3rd Qu.:0.5800 3rd Qu.:0.4900 3rd Qu.:0.7000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## PctPersDenseHous PctHousLess3BR MedNumBR HousVacant
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0600 1st Qu.:0.4000 1st Qu.:0.0000 1st Qu.:0.01000
## Median :0.1100 Median :0.5100 Median :0.5000 Median :0.03000
## Mean :0.1863 Mean :0.4952 Mean :0.3147 Mean :0.07682
## 3rd Qu.:0.2200 3rd Qu.:0.6000 3rd Qu.:0.5000 3rd Qu.:0.07000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## PctHousOccup PctHousOwnOcc PctVacantBoarded PctVacMore6Mos
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.6300 1st Qu.:0.4300 1st Qu.:0.0600 1st Qu.:0.2900
## Median :0.7700 Median :0.5400 Median :0.1300 Median :0.4200
## Mean :0.7195 Mean :0.5487 Mean :0.2045 Mean :0.4333
## 3rd Qu.:0.8600 3rd Qu.:0.6700 3rd Qu.:0.2700 3rd Qu.:0.5600
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## MedYrHousBuilt PctHousNoPhone PctWOFullPlumb OwnOccLowQuart
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3500 1st Qu.:0.0600 1st Qu.:0.1000 1st Qu.:0.0900
## Median :0.5200 Median :0.1850 Median :0.1900 Median :0.1800
## Mean :0.4942 Mean :0.2645 Mean :0.2431 Mean :0.2647
## 3rd Qu.:0.6700 3rd Qu.:0.4200 3rd Qu.:0.3300 3rd Qu.:0.4000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## OwnOccMedVal OwnOccHiQuart RentLowQ RentMedian
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0900 1st Qu.:0.0900 1st Qu.:0.1700 1st Qu.:0.2000
## Median :0.1700 Median :0.1800 Median :0.3100 Median :0.3300
## Mean :0.2635 Mean :0.2689 Mean :0.3464 Mean :0.3725
## 3rd Qu.:0.3900 3rd Qu.:0.3800 3rd Qu.:0.4900 3rd Qu.:0.5200
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## RentHighQ MedRent MedRentPctHousInc MedOwnCostPctInc
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.220 1st Qu.:0.2100 1st Qu.:0.3700 1st Qu.:0.3200
## Median :0.370 Median :0.3400 Median :0.4800 Median :0.4500
## Mean :0.423 Mean :0.3841 Mean :0.4901 Mean :0.4498
## 3rd Qu.:0.590 3rd Qu.:0.5300 3rd Qu.:0.5900 3rd Qu.:0.5800
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## MedOwnCostPctIncNoMtg NumInShelters NumStreet PctForeignBorn
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.2500 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0600
## Median :0.3700 Median :0.00000 Median :0.00000 Median :0.1300
## Mean :0.4038 Mean :0.02944 Mean :0.02278 Mean :0.2156
## 3rd Qu.:0.5100 3rd Qu.:0.01000 3rd Qu.:0.00000 3rd Qu.:0.2800
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## PctBornSameState PctSameHouse85 PctSameCity85 PctSameState85
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4700 1st Qu.:0.4200 1st Qu.:0.5200 1st Qu.:0.5600
## Median :0.6300 Median :0.5400 Median :0.6700 Median :0.7000
## Mean :0.6089 Mean :0.5351 Mean :0.6264 Mean :0.6515
## 3rd Qu.:0.7775 3rd Qu.:0.6600 3rd Qu.:0.7700 3rd Qu.:0.7900
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## LemasSwornFT LemasSwFTPerPop LemasSwFTFieldOps LemasSwFTFieldPerPop
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.06966 1st Qu.:0.2175 1st Qu.:0.9247 1st Qu.:0.2463
## Median :0.06966 Median :0.2175 Median :0.9247 Median :0.2463
## Mean :0.06966 Mean :0.2175 Mean :0.9247 Mean :0.2463
## 3rd Qu.:0.06966 3rd Qu.:0.2175 3rd Qu.:0.9247 3rd Qu.:0.2463
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## LemasTotalReq LemasTotReqPerPop PolicReqPerOffic PolicPerPop
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.09799 1st Qu.:0.2152 1st Qu.:0.3436 1st Qu.:0.2175
## Median :0.09799 Median :0.2152 Median :0.3436 Median :0.2175
## Mean :0.09799 Mean :0.2152 Mean :0.3436 Mean :0.2175
## 3rd Qu.:0.09799 3rd Qu.:0.2152 3rd Qu.:0.3436 3rd Qu.:0.2175
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## RacialMatchCommPol PctPolicWhite PctPolicBlack PctPolicHisp
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.6894 1st Qu.:0.727 1st Qu.:0.2205 1st Qu.:0.1349
## Median :0.6894 Median :0.727 Median :0.2205 Median :0.1349
## Mean :0.6894 Mean :0.727 Mean :0.2205 Mean :0.1349
## 3rd Qu.:0.6894 3rd Qu.:0.727 3rd Qu.:0.2205 3rd Qu.:0.1349
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## PctPolicAsian PctPolicMinor OfficAssgnDrugUnits NumKindsDrugsSeiz
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1149 1st Qu.:0.2592 1st Qu.:0.07555 1st Qu.:0.5561
## Median :0.1149 Median :0.2592 Median :0.07555 Median :0.5561
## Mean :0.1149 Mean :0.2592 Mean :0.07555 Mean :0.5561
## 3rd Qu.:0.1149 3rd Qu.:0.2592 3rd Qu.:0.07555 3rd Qu.:0.5561
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PolicAveOTWorked LandArea PopDens PctUsePubTrans
## Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.306 1st Qu.:0.02000 1st Qu.:0.1000 1st Qu.:0.0200
## Median :0.306 Median :0.04000 Median :0.1700 Median :0.0700
## Mean :0.306 Mean :0.06523 Mean :0.2329 Mean :0.1617
## 3rd Qu.:0.306 3rd Qu.:0.07000 3rd Qu.:0.2800 3rd Qu.:0.1900
## Max. :1.000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## PolicCars PolicOperBudg LemasPctPolicOnPatr LemasGangUnitDeploy
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1631 1st Qu.:0.07671 1st Qu.:0.6986 1st Qu.:0.4404
## Median :0.1631 Median :0.07671 Median :0.6986 Median :0.4404
## Mean :0.1631 Mean :0.07671 Mean :0.6986 Mean :0.4404
## 3rd Qu.:0.1631 3rd Qu.:0.07671 3rd Qu.:0.6986 3rd Qu.:0.4404
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## LemasPctOfficDrugUn PolicBudgPerPop ViolentCrimesPerPop
## Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.1951 1st Qu.:0.070
## Median :0.00000 Median :0.1951 Median :0.150
## Mean :0.09405 Mean :0.1951 Mean :0.238
## 3rd Qu.:0.00000 3rd Qu.:0.1951 3rd Qu.:0.330
## Max. :1.00000 Max. :1.0000 Max. :1.000
cat("Data cleaned. Shape:", nrow(data_clean), "rows and", ncol(data_clean), "columns\n")
## Data cleaned. Shape: 1994 rows and 123 columns
library(corrplot)
## corrplot 0.95 loaded
# Compute correlations among numeric columns
crime_num <- data_clean %>% select(where(is.numeric))
corr_matrix <- cor(crime_num, use = "complete.obs")
# Plot heatmap for top correlated features
corrplot(corr_matrix, type = "lower", tl.cex = 0.2, tl.col = "black")
# Target Variable Distribution
ggplot(data_clean, aes(x = ViolentCrimesPerPop)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Distribution of Violent Crime Rate",
x = "Violent Crimes per Population", y = "Frequency")
ggplot(data_clean, aes(x = PctPopUnderPov, y = ViolentCrimesPerPop)) +
geom_point(alpha = 0.5, color = "purple") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Violent Crime vs Poverty", x = "Poverty (%)", y = "Violent Crime Rate")
## `geom_smooth()` using formula = 'y ~ x'
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
train_index <- createDataPartition(data_clean$ViolentCrimesPerPop, p = 0.8, list = FALSE)
train <- data_clean[train_index, ]
test <- data_clean[-train_index, ]
# Correlation with target
corrs <- sapply(data_clean, function(x) cor(x, data_clean$ViolentCrimesPerPop, use = "complete.obs"))
head(sort(corrs, decreasing = TRUE), 10)
## ViolentCrimesPerPop PctIlleg racepctblack pctWPubAsst
## 1.0000000 0.7379565 0.6312636 0.5746653
## FemalePctDiv TotalPctDiv MalePctDivorce PctPopUnderPov
## 0.5560319 0.5527774 0.5254073 0.5218765
## PctUnemployed PctHousNoPhone
## 0.5042346 0.4882435
summary(data_clean$ViolentCrimesPerPop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.070 0.150 0.238 0.330 1.000
# Check a few relationships
plot(data_clean$medIncome, data_clean$ViolentCrimesPerPop,
main = "Crime vs Median Income",
xlab = "Median Income", ylab = "Violent Crime Rate")
plot(data_clean$pctWWage, data_clean$ViolentCrimesPerPop,
main = "Crime vs WWage",
xlab = "WWage %", ylab = "Violent Crime Rate")
# Build two models
# (a) Linear Regression
lm_model <- lm(ViolentCrimesPerPop ~ ., data = train)
lm_pred <- predict(lm_model, newdata = test)
# (b) Random Forest
rf_model <- randomForest(ViolentCrimesPerPop ~ ., data = train, ntree = 150)
rf_pred <- predict(rf_model, newdata = test)
# Evaluate models
# Calculate R-squared (how well model fits) and RMSE (error)
lm_r2 <- cor(lm_pred, test$ViolentCrimesPerPop)^2
rf_r2 <- cor(rf_pred, test$ViolentCrimesPerPop)^2
lm_rmse <- sqrt(mean((lm_pred - test$ViolentCrimesPerPop)^2))
rf_rmse <- sqrt(mean((rf_pred - test$ViolentCrimesPerPop)^2))
cat("Linear Regression: R2 =", lm_r2, " RMSE =", lm_rmse, "\n")
## Linear Regression: R2 = 0.6788732 RMSE = 0.1298727
cat("Random Forest: R2 =", rf_r2, " RMSE =", rf_rmse, "\n")
## Random Forest: R2 = 0.7044982 RMSE = 0.1251186
# Visualize predictions
results <- data.frame(
Actual = test$ViolentCrimesPerPop,
Predicted = rf_pred
)
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue", alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Random Forest Predictions vs Actual",
x = "Actual Violent Crime Rate",
y = "Predicted Crime Rate")
# Feature importance (which factors matter most)
varImpPlot(rf_model, n.var = 10, main = "Most Important Features")
# Residual Analysis (Model Diagnostics)
residuals_rf <- test$ViolentCrimesPerPop - rf_pred
ggplot(data.frame(residuals_rf), aes(x = residuals_rf)) +
geom_histogram(bins = 30, fill = "darkorange", color = "white") +
labs(title = "Residual Distribution (Random Forest)",
x = "Residuals", y = "Count")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(crime, aes(x = PctPopUnderPov, y = ViolentCrimesPerPop)) +
geom_point(alpha = 0.6, color = "purple") +
labs(title = "Interactive: Poverty vs Violent Crime",
x = "Poverty (%)", y = "Violent Crime Rate")
ggplotly(p)